Correct avg reduction across devices when using ccl backend. #8

pujaltes · 2024-05-16T14:56:01Z

Like gloo it appears that ccl does not support the ReduceOp.AVG operation (see example below). To avoid errors when using avg operation for reducing across devices I simply extended the checks PL already had in place for gloo.

Example of ccl backend error running mpiexec -n 4 python mpitest.py:

import torch
from torch.distributed.distributed_c10d import _get_default_group
import intel_extension_for_pytorch
import oneccl_bindings_for_pytorch
import os
from lightning.fabric.utilities.types import ReduceOp


os.environ["MASTER_ADDR"] = "pvc-s-162"
os.environ["MASTER_PORT"] = "29502"
os.environ["RANK"] = os.environ.get("PMI_RANK", "0")
os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")
init_method = "env://"
print(f"RANK: {os.environ['RANK']}, WORLD_SIZE: {os.environ['WORLD_SIZE']}", flush=True)

torch.distributed.init_process_group(backend='ccl', init_method=init_method)
test_tensor = torch.rand(1, 1, 40966, dtype=torch.float16, device=f"xpu:{os.environ['RANK']}")
print(f"Device: {test_tensor.device}", flush=True)

# NOTE: The error ocurrs regardless of how you define the process group
group = torch.distributed.group.WORLD
# group = _get_default_group()

# op = ReduceOp.SUM  # Fine
op = ReduceOp.AVG  # Error
torch.distributed.all_reduce(test_tensor, group=group, async_op=False, op=ReduceOp.SUM)
print("DONE!", flush=True)

📚 Documentation preview 📚: https://pytorch-lightning--8.org.readthedocs.build/en/8/

[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update typos and bug fixes [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci xpu seeding PR1 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add seeding for pytorch utilities mp_fabric xpu forking xpu multiprocess pytorch add header for xpu rename change to lightning.pytorch [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Teardown from lightning-xpu (from #PR- 3) From Lightning-AI#3 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add torch.xpu.stream to ddp update docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update _LIGHTNING_XPU_AVAILABLE to _lightning_xpu_available correct fabric imports.py 1. remove xpu.py from _graveyard 2. correct _lightning_xpu_available() usage fix _try_import function not defined issue in fabric add docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix circle import issue update pytorch trainer connector [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci correct usage in multiprocessing Fix precision device [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update warning format

jingxu10 and others added 2 commits April 28, 2024 05:57

Correct avg reduction across devices when using ccl backend.

4aa7645

github-actions bot added the fabric label May 16, 2024

jingxu10 force-pushed the jingxu10/ipex_2 branch from 90cde8d to fb378cc Compare May 29, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct avg reduction across devices when using ccl backend. #8

Correct avg reduction across devices when using ccl backend. #8

Uh oh!

pujaltes commented May 16, 2024 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Correct avg reduction across devices when using ccl backend. #8

Are you sure you want to change the base?

Correct avg reduction across devices when using ccl backend. #8

Uh oh!

Conversation

pujaltes commented May 16, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pujaltes commented May 16, 2024 •

edited by github-actions bot

Loading